Predicting the English Premier League

Tutorial by Christopher Cole and Daniel Levy

The English Premier League (referred to as the EPL) is the top level of the English football system. Throughout time the game of football has drastically changed, and there now exists countless statistics used to analyze the game. Data science has become very essential to sports teams as they try to find the hidden and important statistics which can help lead to wins. Using all EPL games from the past 10 seasons we will analyze the different factors which contribute to whether a team wins or loses. We will determine the specific stats which determine the outcome of the game. Our overall goal is to use the different patterns and statistics in order to predict the next winner of the EPL.

Data Curation

Importing Libraries

In our Data Curation section, we are gathering all of our data and putting them inside of panda data frames. The CSV data collected comes datahub, which has data a dataset from every single single EPL season from the past 10 seasons. Originally, each dataset has 62 different statistics, and shows the outcome from every game that season.

Our first step is cutting down the number of statistics, and choose what we believe are the most key statistics in order to predict the league champion. We decided that the 4 most important stats per each game are the following: Shots, shots on target, number of fouls, and the outcome of the game. After creating a new dataframe consisting of just these key statistics, our next step is to split the data so that each EPL team will have its own dataframe. As a result, each EPL team has a dataframe per season, which consists of the result of each of its games and the 4 key stats as the additional columns.

Reading in all the data

Splitting each team's data in its own data frame

Exploratory Data Analysis

In this graph we are looking at shots on average vs the team's winning percentage. We can see that a few teams truly stand out when it comes to the number of shots on average, these teams are, Man City, Man United, Arsenal, Chelsea, Tottenham, and Liverpool. Not including these standout teams, a hige bulk of the league has the same number of shots on average which lead to the them having very very similar win percentage. Overall we can see that there is a direct correlation between the shots on average and the teams winning percentage. The lower tier teams such as Reading, Huddersfield, and Middlesbrough, all have the 3 lowest shots on average and as a result have the lowest winning percentage. The teams who double there shots on average, come close to having a winning percentage being almost twice as better. Having a high shots on average per game, is hugely important for a teams chances to win.

In the above we now looked at winning percentage vs shots on target on average. In the previous graph we looked at winning percentage vs shots on average, and now a key difference is that we are looking at shots on target,and not just shots. Very similar to the previous graph on average shots, the top tier of teams all average the most shots on average, the teams are Man City, Man United, Chelsea, Arsenal, Tottenham, and Liverpool. The lower tier of teams of Middlesbrough, Huddersfield, Hull, Cardiff, and Brighton, all average the least number of Shots on target. The middle tier of teams are all clustered together in their average of shots on target per game. There are many similaraties between this graph and the previous. For example, Man City averages both the highest shots on target, and averages the most total shots, and in both graphs we can see they lead with the highest winning percentage.

Next we looked at the average number of fouls vs the team's winning percantge. Unlike the previous graph, there isn't a direct linear trend occuring here. A few of the teams which have the the least number of fouls commmited, actually have the worst winning percentage here, which includes Bornemouth, Swansea, and Cardiff. Very interestingly the top tier teams with the highest winning percanage, all commited around the same number of fouls per game, which includes Man City, Man United, Chelsea, Arsenal, Tottenham, Liverpool. While there isn't a linear correlation between Average fouls vs the winning percentage, there still is a direct pattern between a team's success. As mentioned the team's with the highest winnning percentage committed the average number of the fouls in a similar range, from 10 to nearly 11.3 per game.

While the graphs above gave a great indicator between the correleation of our stats and a team's winning success, it would be very helpful to have another tyoe of visualization to show us where most of the data is clustered. Violin plots are able to indicate to us where most of our data is clustered, showing whether its clustered around the minimum, maxiumum or the average. In the next section we plotted multiple violin plots for each team, with each violot plot having one of our key stats on the y axis being compared to the outcome of the game on the x axis. We have three different outcomes which we loooked at the distribution for, and those are a W which represents a win, an L which represents a loss, and a D which represents a draw.

Eeach team's violin plot

Analysis of Violin Plots: Looking at the above plots, the overall idea we can get from them is that the data is approximately normally distributed. On almost every violin each one has its own peak, and no violin is very skewed to any side. Now that we know that the data seems to ber overall normally distributed, we are able to more confidentally use our key stats going forward.

Machine Learning and Hypothesis testing

Our hypothesis is that we can accurtely predict the outcome of a game using our key stats of average number of fouls, shots on target, and average shots taken. We will now test whether these 3 elements can truly predict the outcome of the game, showing which team is most likely to come out on top. We will check if there is a linear coorelation between these sts and the team winning the game.

We used the random forest Algorithm in order to make a prediction based off of average shots, and average shots on target, and number of fouls. The random forest algorithm builds many different decision trees and takes a majority vote off of them in regards to classification and average.

The chart above is a pie graph showing the result of our algorithm. As shown our algorithim comes out to be 71.4% accurate, and 28.6% inaccurarate. Our machine learning algorthim comes out to be a 71.4% accurate, showing that nearly 3/4 of the time this algorthim can be used efficiently in making the predictions.

Conclusion: The English Premier League is one of the mosty elite sporting leagues in the world. It is incredible how many stats we are abke to keep track of nowadays in the sport of soccer. Everything is being tracked, and every team wants to take full advantage of these advanced stats and increase there chances of winning. After narrowing down the numnerous stats to a few key ones, we wanted to understand how each of these stats impact the game. It was remarkable to see how much of an impact they have on the outcome of the game. These analyses can be very important for sports teams as they search for the most important stats to look at it, or to find the hidden gem stats that other teams aren't looking at.

Useful Links: https://www.premierleague.com/premier-league-explained

https://help.vidswap.com/hc/en-us/articles/207646186-Abbreviations-In-Soccer-Stat-Sheets

https://www.section.io/engineering-education/introduction-to-random-forest-in-machine-learning/

https://www.bbc.co.uk/newsround/40891247